Search CORE

55 research outputs found

Optimizing model-agnostic Random Subspace ensembles

Author: Geurts Pierre
Huynh-Thu Vân Anh
Publication venue
Publication date: 20/01/2023
Field of study

This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. The degree of randomization in our parametric Random Subspace is thus automatically tuned through the optimization of the feature selection probabilities. This is an advantage over the standard Random Subspace approach, where the degree of randomization is controlled by a hyper-parameter. Furthermore, the optimized feature selection probabilities can be interpreted as feature importance scores. Our algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores

arXiv.org e-Print Archive

Context-dependent feature analysis with random forests

Author: Geurts Pierre
Huynh-Thu Vân Anh
Louppe Gilles
Sutera Antonio
Wehenkel Louis
Publication venue
Publication date: 12/05/2016
Field of study

In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.Comment: Accepted for presentation at UAI 201

arXiv.org e-Print Archive

Open Repository and Bibliography - Liège

Statistical interpretation of machine learning-based feature importance scores for biomarker discovery

Author: Geurts Pierre
Huynh-Thu Vân Anh
Saeys Yvan
Wehenkel Louis
Publication venue: 'Oxford University Press (OUP)'
Publication date: 25/04/2012
Field of study

Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians. Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive

Open Repository and Bibliography - Liège

Inferring Regulatory Networks from Expression Data Using Tree-Based Methods

Author: Geurts Pierre
Huynh-Thu Vân Anh
Irrthum Alexandre
Wehenkel Louis
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Open Repository and Bibliography - Liège

Inferring gene regulatory networks using ensembles of feature selection techniques

Author: Demeester Piet
Dhaene Tom
Geurts Pierre
Huynh-thu Vân anh
Ruyssinck Joeri
Saeys Yvan
Publication venue
Publication date: 01/01/2012
Field of study

Ghent University Academic Bibliography

From global to local MDI variable importances for random forests and when they are Shapley values

Author: Geurts Pierre
Huynh-Thu Vân Anh
Louppe Gilles
Sutera Antonio
Wehenkel Louis
Publication venue
Publication date: 03/11/2021
Field of study

peer reviewedRandom forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context, we first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions. Then, we derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance. We further link local MDI importances with Shapley values and discuss them in the light of related measures from the literature. The measures are illustrated through experiments on several classification and regression problems

arXiv.org e-Print Archive

Open Repository and Bibliography - Liège

OP12 Blood proteins related to immunoregulation or cellular junctions reveal distinct biological profiles associated with the risk of short-term versus mid/long-term relapse in Crohn’s Disease patients stopping infliximab

Author: Allez M.
Bouhnik Y.
Bourreille A.
Colombel J. F.
Huynh-Thu Vân Anh
Laharie D.
Louis Edouard
Meuwis Marie-Alice
Pierre Nicolas
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2021
Field of study

Open Repository and Bibliography - Liège

Combining tree-based and dynamical systems for the inference of gene regulatory networks

Author: Alberts
Alon
Bansal
Bansal
Blanc
Breiman
Cantone
De Smet
de Visser
Ellwanger
Faith
Gardiner
Geurts
Geurts
Gray
Greenfield
Greenfield
Guido Sanguinetti
Huynh-Thu
Huynh-Thu
Lèbre
Marbach
Margolin
Ocone
Opper
Penfold
Prill
Ptashne
The ENCODE Project Consortium
Vogel
Vân Anh Huynh-Thu
Wang
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

Motivation: Reconstructing the topology of gene regulatory networks (GRNs) from time series of gene expression data remains an important open problem in computational systems biology. Existing GRN inference algorithms face one of two limitations: model-free methods are scalable but suffer from a lack of interpretability and cannot in general be used for out of sample predictions. On the other hand, model-based methods focus on identifying a dynamical model of the system. These are clearly interpretable and can be used for predictions; however, they rely on strong assumptions and are typically very demanding computationally. Results: Here, we propose a new hybrid approach for GRN inference, called Jump3, exploiting time series of expression data. Jump3 is based on a formal on/off model of gene expression but uses a non-parametric procedure based on decision trees (called "jump trees") to reconstruct the GRN topology, allowing the inference of networks of hundreds of genes. We show the good performance of Jump3 on in silico and synthetic networks and applied the approach to identify regulatory interactions activated in the presence of interferon gamma. Availability and implementation: Our MATLAB implementation of Jump3 is available at http:// homepages.inf.ed.ac.uk/vhuynht/software.html

CiteSeerX

Crossref

PubMed Central

Edinburgh Research Explorer

Sissa Digital Library

Open Repository and Bibliography - Liège

Distinct blood protein profiles associated with the risk of short-term and mid/long-term clinical relapse in patients with Crohn's disease stopping infliximab: when the remission state hides different types of residual disease activity.

Author: Allez Matthieu
Bouhnik Yoram
Bourreille Arnaud
Colombel Jean-Frédéric
GETAID (Groupe d’Etude Thérapeutique des Affections Inflammatoires du tube Digestif)
Huynh-Thu Vân Anh
Laharie David
Louis Edouard
Marichal Thomas
Meuwis Marie-Alice
Pierre Nicolas
Publication venue: 'BMJ'
Publication date: 25/08/2022
Field of study

peer reviewed[en] OBJECTIVE: Despite being in sustained and stable remission, patients with Crohn's disease (CD) stopping anti-tumour necrosis factor α (TNFα) show a high rate of relapse (~50% within 2 years). Characterising non-invasively the biological profiles of those patients is needed to better guide the decision of anti-TNFα withdrawal. DESIGN: Ninety-two immune-related proteins were measured by proximity extension assay in serum of patients with CD (n=102) in sustained steroid-free remission and stopping anti-TNFα (infliximab). As previously shown, a stratification based on time to clinical relapse was used to characterise the distinct biological profiles of relapsers (short-term relapsers: 6 months). Associations between protein levels and time to clinical relapse were determined by univariable Cox model. RESULTS: The risk (HR) of mid/long-term clinical relapse was specifically associated with a high serum level of proteins mainly expressed in lymphocytes (LAG3, SH2B3, SIT1; HR: 2.2-4.5; p<0.05), a low serum level of anti-inflammatory effectors (IL-10, HSD11B1; HR: 0.2-0.3; p<0.05) and cellular junction proteins (CDSN, CNTNAP2, CXADR, ITGA11; HR: 0.4; p<0.05). The risk of short-term clinical relapse was specifically associated with a high serum level of pro-inflammatory effectors (IL-6, IL12RB1; HR: 3.5-3.6; p<0.05) and a low or high serum level of proteins mainly expressed in antigen presenting cells (CLEC4A, CLEC4C, CLEC7A, LAMP3; HR: 0.4-4.1; p<0.05). CONCLUSION: We identified distinct blood protein profiles associated with the risk of short-term and mid/long-term clinical relapse in patients with CD stopping infliximab. These findings constitute an advance for the development of non-invasive biomarkers guiding the decision of anti-TNFα withdrawal

Open Repository and Bibliography - Liège

dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data

Author: Geurts Pierre
Huynh-Thu Vân Anh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/02/2018
Field of study

Abstract The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability

Directory of Open Access Journals

Open Repository and Bibliography - Liège